Key-string algorithm--novel approach to computational analysis of repetitive sequences in human centromeric DNA.
نویسندگان
چکیده
AIM To use a novel computational approach, Key-string Algorithm (KSA), for the identification and analysis of arbitrarily large repetitive sequences and higher-order repeats (HORs) in noncoding DNA. This approach is based on the use of key string that plays a role of an arbitrarily constructed "computer enzyme". METHOD A cluster of novel KSA-related methods was introduced and developed on the basis of a combination of computations on a very modest scale, by eye inspection and graphical display of results of analysis. Sequence analysis software was developed, containing seven programs for KSA-related analyses. This approach was demonstrated in the case study of alpha satellites and HORs in the human genetic sequence AC017075.8 (193277 bp) from the centromeric region of human chromosome 7. The KSA segmentation method was applied by using DCCGTTT, GTA, and TTTC key strings. RESULTS Fifty-five copies of 2734-bp 16mer HORs were identified and investigated, and a start-string TTTTTTAAAAA was identified. The HOR-matrix was constructed and employed for graphical display of mutations. KSA identification of HORs in AC017075.8 was compared with that of RepeatMasker and Tandem Repeat Finder, which identified alpha monomers in AC017075.8, but not the HORs. On the basis of KSA study, the centromere folding was described as an effect of HORs and super-HORs (3 x 2734 bp) in AC017075.8. The following novel computational KSA-based methods, easy-to-use and intended for computational "pedestrians", were demonstrated: color-HOR diagram, KSA-divergence method, 171-bp subsequence-convergence diagram, and total frequency distribution of the key-string subsequence lengths. The results were supplemented by Fast Fourier Transform, employing a novel mapping of symbolic genomic sequence into a numerical sequence. CONCLUSION The KSA approach offers a simple and robust framework for a wide range of investigations of large repetitive sequences and HORs, involving a very modest scope of computations that can be carried out by using a PC. As the KSA method is HOR-oriented, the identification of HORs is even easier than the identification of underlying alpha monomer itself. This approach provides an easy identification of point mutations, insertions, and deletions, with respect to consensus. This may be useful in a wide range of investigations and applied in forensic medicine, medical diagnosis of malignant diseases, biological evolution, and paleontology.
منابع مشابه
Mining Biological Repetitive Sequences Using Support Vector Machines and Fuzzy SVM
Structural repetitive subsequences are most important portion of biological sequences, which play crucial roles on corresponding sequence’s fold and functionality. Biggest class of the repetitive subsequences is “Transposable Elements” which has its own sub-classes upon contexts’ structures. Many researches have been performed to criticality determine the structure and function of repetitiv...
متن کاملNovel sequencing strategy for repetitive DNA in a Drosophila BAC clone reveals that the centromeric region of the Y chromosome evolved from a telomere†
The centromeric and telomeric heterochromatin of eukaryotic chromosomes is mainly composed of middle-repetitive elements, such as transposable elements and tandemly repeated DNA sequences. Because of this repetitive nature, Whole Genome Shotgun Projects have failed in sequencing these regions. We describe a novel kind of transposon-based approach for sequencing highly repetitive DNA sequences i...
متن کاملgpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences
Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...
متن کاملP-215: Discovery of A Novel APA Variant of A Human Potential Gene Based on Expressed Sequenced Tags Analysis
Background: Expressed sequence tags (ESTs) are sequences of cDNA fragments prepared from different tissue sources. There are over one million of these sequences in the publicly available database, and these sequences are believed to represent more than half of all human genes. The ESTs belong to different cDNA libraries, was prepared from one particular cell type, organ, or tumor. Therefore, th...
متن کاملOrganization and Evolution of Primate Centromeric DNA from Whole-Genome Shotgun Sequence Data
The major DNA constituent of primate centromeres is alpha satellite DNA. As much as 2%-5% of sequence generated as part of primate genome sequencing projects consists of this material, which is fragmented or not assembled as part of published genome sequences due to its highly repetitive nature. Here, we develop computational methods to rapidly recover and categorize alpha-satellite sequences f...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Croatian medical journal
دوره 44 4 شماره
صفحات -
تاریخ انتشار 2003